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O . Abstract 

o ■ 

^vq . We continue to discuss why MMSE estimation arises in coding schemes that approach 

the capacity of linear Gaussian channels. Here we consider schemes that involve successive 
. decoding, such as decision-feedback equalization or successive cancellation. 

^ : 

. "Everything should be made as simple as possible, but not simpler." — A. Einstein. 

(N 

1 Introduction 

The occurrence of minimum-mean-squared-error (MMSE) linear estimation filters in constructive 
coding schemes that approach information-theoretic limits of linear Gaussian channels has been 
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repeatedly observed, and justified by various arguments. For example, in an earlier paper [3] we 
showed the necessity of the MMSE estimation factor in the capacity-approaching lattice coding 
scheme of Erez and Zamir [3] for the classic additive white Gaussian noise (AWGN) channel. 

In particular, MMSE decision-feedback equalizer (MMSE-DFE) filters have been used in 
coding schemes that approach the capacity of linear Gaussian intersymbol interference (ISI) 
channels PP, and generalized MMSE-DFE (MMSE-GDFE) filters have been used in coding 
schemes that approach the capacity region of multiple-input, multiple-output (MIMO) linear 
Gaussian channels |2J . These successive decoding schemes combine "analog" discrete-time linear 
MMSE estimation with the essentially "digital" assumption of ideal decision feedback (perfect 
prior decisions). 

The fact that MMSE filters allow information-theoretic limits to be approached in successive 
decoding scenarios is widely understood, and has been proved in various ways. Our aim here is 
to provide the simplest and most transparent justification possible. Some principal features of 
our approach are: 

• As in [21 El , we use a geometric Hilbert space formulation; 

• Our results are based mainly on the sufficiency property of MMSE estimators, with 
information-theoretic results mostly as corollaries; 

• Proofs of almost all results are given. All proofs are brief and straightforward. 

In developing this approach, we have benefited from our earlier work with Ciofli et al. PJE] and 
from the insightful development of Guess and Varanasi |3 E] • We would also like to acknowledge 
helpful comments on earlier drafts of this paper by G. Caire, J. Cioffi, U. Erez, T. Guess, S. 
Shamai and G. Wornell. 



1.1 Hilbert spaces of jointly Gaussian random variables 



All random variables in this note will be finite- variance, zero-mean, proper (circularly symmetric) 
complex Gaussian random variables. Random variables will be denoted by capital letters such 
as X. If the variance a 2 of X is nonzero, then X has a probability density function (pdf) 

1 Ixj 2 
Px(x) = — ~ exp 5-, 

and thus its differential entropy is h(X) = E[— logpx(x)] = logirea 2 . If the variance of X is 
zero, then X is the deterministic zero variable 0. 

Sets of such random variables will be denoted by a script letter such as X = {Xi}. In this 
paper, we will consider only finite sets of random variables. A particular application may involve 
a finite set of such sets such as {X, y, Z}. 

Whenever we have a set of Gaussian variables, their statistics will be assumed to be jointly 
Gaussian. A set of variables is jointly Gaussian if they can all be expressed as linear combinations 
of a common set of independent Gaussian random variables. It follows that any set of linear 
combinations of jointly Gaussian random variables is jointly Gaussian. 

The set of all complex linear combinations of a given finite set X of finite- variance, zero-mean, 
proper jointly Gaussian complex random variables is evidently a complex vector space Q. Every 
element of Q is a finite- variance, zero-mean, proper complex Gaussian random variable, and 
every subset of Q is jointly Gaussian. The zero vector of Q is the unique zero variable 0. The 
dimension of Q is at most the size \X\ of X. 

It is well known that if an inner product is defined on Q as the cross-correlation (X, Y) = 
E[JY*], then Q becomes a Hilbert space (a complete inner product space), a subspace of the 
Hilbert space TL consisting of all finite-variance zero-mean complex random variables. The 
squared norm of X € Q is then its variance, ||A|| 2 = (X, X) = E[|X| 2 ]. Variances are real, 
finite and strictly non-negative; i.e., if X € Q has zero variance, ||A|| 2 = 0, then X must be the 
deterministic zero variable, X = 0. 

If Q is generated by X, then all inner products between elements of Q are determined by the 
inner product (autocorrelation) matrix R xx = {(X,X') \ X, X' € X} (the Gram matrix of X). 
In other words, the matrix R xx completely determines the geometry of Q. Since all subsets of 
variables in Q are jointly Gaussian, the joint statistics of any such subset of Q are completely 
determined by their second-order statistics, and thus by R xx . 

A subset y C Q is called linearly dependent if there is some linear combination of the elements 
of y that is equal to the zero variable 0, and linearly independent otherwise. We will see that a 
subset y C Q is linearly independent if and only if its autocorrelation matrix R yy has full rank. 

Two random variables are orthogonal if their inner product is zero; i.e., if they are uncorrelated. 
If two jointly Gaussian variables are orthogonal, then they are statistically independent. The 
only variable in Q that is orthogonal to itself {i.e., satisfies (X,X) = 0) is the zero variable 0. 
If (X, Y) = 0, then the Pythagorean theorem holds: 

\\X + Y\\ 2 = \\X\\ 2 + ||Y|| 2 . 

Given any subset y C Q, the closure y of y, or the subspace generated by y, is the set of 
all linear combinations of elements of y. Also, the set of all X £ Q that are orthogonal to all 
elements of y is a subspace of Q, called the orthogonal subspace y 1 - C Q. Since is the only 
element of Q that is orthogonal to itself, the only common element of 3^ and y 1 - is 0. 
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1.2 The projection theorem 



The key geometric property of the Hilbert space Q is the projection theorem: if V and V 
are orthogonal subspaces of Q, then there exists a unique X\y G V and X±y G V -1 such that 
X = X\ v + X±y. X\ v and X±\> are called the projections of X onto V and V -1 , respectively. 

A explicit formula for a projection Xiy such that X — Xiy G V -1- will be given below. Uniqueness 
is the most important part of the projection theorem, and may be proved as follows: if X = Y+Z 
and also X = Y' + Z', where Y,Y' eV and Z, 2'eV 1 , then 

= \ \X - X\\ 2 = \\Y -Y'\\ 2 + \\Z - Z'\\ 2 , 

where the Pythagorean theorem applies since Y — Y' G V and Z — Z' G V -1 . Since norms are 
non-negative, this implies \ \Y — Y'\\ 2 = \ \Z — Z'\\ 2 = 0, which implies Y = Y' and Z = Z' . 

The projection theorem is illustrated by the little "Pythagorean" diagram below. Since X\y 
and X±\> are orthogonal, we have \\X\\ 2 = ||X|y|| 2 + ||X_|_y|| 2 . 

X±v 
x \v 

If y is a subspace that is generated by a set of variables y, then with mild abuse of notation 
we will write X\y and X±y rather than X^y and X ± y. 

1.3 Innovations representations 

Let X C Q be a finite subset of elements of Q, and let X C Q be the subspace of Q generated by 
X . An orthogonal basis for X may then be found by a recursive (Gram-Schmidt) decomposition, 
as follows. 

Denote the elements of the generator set X by X±, X2, ■ ■ ., and let denote the subspace 

of Q generated by X^ 1 = {Xi, X2, ■ ■ ■ , ^Q-i}- To initialize, set i = 1 and = 0. For the ith 
recursion, using the projection theorem, write Xj uniquely as 

X = (Xi)\ X i-i + (Xi) ±;t i-i. 

We have (Xi) ±x i-i = if and only if Xi G X[ _1 . In this case X\ = so we can delete Xi 

from the generator set X without affecting X. Otherwise, we can take the "innovation" variable 
Ei = (Xi) ±x i-i ^ as a replacement for Xi in the generator set; the space generated by X^ 1 

and Ei is still X[, but Ei is orthogonal to X^ 1 . By induction, the nonzero innovations variables 
up to Ei generate X\ and are mutually orthogonal; i.e., they form an orthogonal basis for X\. 
This recursive decomposition thus shows that: 

• Any generator set X for a subspace X contains a linearly independent generator set X' C X 
that generates X. Therefore, without loss of generality, we may assume that any generator 
set X for X is linearly independent. 
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• Given a linearly independent generator set X = {Xi, X 2 , ■ ■ ■} for X, we can find an orthog- 
onal basis £ = {Ei, E 2 , ■ ■ ■ } for X, where = (Xi) ±x i-i = X i — (X i ), x i-i. Since (XA^i-i 
is a linear combination of X±, X2, ■ ■ ■ , if we write X and £ as column vectors, then 

we have 

£ = L~ l X, 

where L~ l is a monic {i.e., having ones on the diagonal) lower triangular matrix. Since 
L^ 1 is square and has a monic lower triangular inverse L, we may write alternatively 

X = L£. 

We conclude that a finite set of random variables X is jointly Gaussian if and only if X can be 
written as a monic lower triangular ("causal") linear transformation X = L£ of an orthogonal 
innovations sequence £. All innovations variables are nonzero (i.e., £ is linearly independent) if 
and only if X is linearly independent. This is called an innovations representation of X. 

Moreover, the expression X = L£ implies that the autocorrelation matrix of X is 

where L* denotes the conjugate transpose of L (a monic upper triangular matrix) , and Ree is 
a non- negative real diagonal matrix D 2 , because £ is an orthogonal sequence. This is called 
a Cholesky decomposition of R XX ; the diagonal elements H-EiH 2 of D 2 are called the Cholesky 
factors of R xx . The Cholesky factors are all nonzero, and thus R ee and R xx have full rank, if 
and only if X is linearly independent. In general, the rank of R xx is the number of nonzero 
innovations variables E{ in the innovations representation X = L£. 

Since L is monic lower triangular, its determinant is 1: \L\ = \L*\ = 1. Therefore 

I R XX I = I Ree I = T J || Ei 1 1 . 



1.4 Differential entropy 

To find the differential entropy h(X ) of a set X oi N linearly independent jointly Gaussian 
random variables, we first recall that the differential entropy of a complex Gaussian variable X 
with variance ||^|| 2 > is h(X) = log7re||X|| 2 . Then we have 

h(x) = h(x 1 ) + h(x 2 \x l ) + ---h{x i \xt 1 ) + --- 

= h{E l ) + h{E 2 ) + ---h{E, i ) + --- 

= log7re||£i|| 2 + log7re||£ 2 || 2 + ■•• + log7re||£;|| 2 + ••• 

= \og(jre) N \R ee \ 

= log^e)^!^!, 

where we use the chain rule of differential entropy, we note that Ei = Xi — (Xi)^ x i-i implies 

h(Ei) = h(Xi I X{~ v ), and we apply the determinantal equalities that arise from the innovations 
representation of X. 

Thus the differential entropy per complex dimension is 

^ = lo g7 re|iU 1/Ar , 

where \R xx \ l ^ N is the geometric mean of the Cholesky factors (or eigenvalues) of R XX . Note that 
this result is independent of the order in which we take the variables in X. 
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1.5 Fundamentals of MMSE estimation theory 



Suppose that X represents a random variable to be estimated and that y represents a set 
of observed variables, where X and y are jointly Gaussian. A linear estimate of X is a linear 
function of 3^; i-£-, a random variable V in the space y. The estimation error is then E = X — V. 

By the projection theorem, the projection Xiy G y minimizes the estimation error variance 
\\X — V\\ 2 over V £ y, because, using the Pythagorean theorem and the fact that X\y — V G y 
while X±y G y 1 , we have 

||X — F|| 2 = H-X^y + X±y — V\\ 2 = \\X\y — V\\ 2 + \\X±y\\ 2 > \\X±y\\ 2 , 

with equality if and only if V = Xiy. For this reason X\y is called the minimum-mean-squared 
error (MMSE) linear estimate of X given y, and X±y is called the MMSE estimation error. 
Moreover, the orthogonality principle holds: V G y is the MMSE linear estimate of X given y 
if and only if X — V is orthogonal to y. 

Similarly, if X C is a set of random variables, then by the orthogonality principle the set 
V G y is the corresponding set of MMSE linear estimates of X given 3^ if and only if 
(X — V,y) = 0, or (V, y) = (X,y). Writing V as a set of linear combinations of the elements 
of y in matrix form, namely V = A xy y, and defining R xy as the cross-correlation matrix (X, y) 
and R yy as the autocorrelation matrix (y,y), we obtain a unique solution 

A X y = Rxy Ry y ) 

where without loss of generality we assume that R yy is invertible; i.e., that y is a linearly 
independent generator set for y. In short, an explicit formula for the projection of X onto y is 

X\y = RxyRyyy- 

The expression X = A xy y+X±y shows that X may be regarded as the sum of a linear estimate 
derived from y and an independent error (innovations) variable £ = X±y. This decomposition 
is illustrated in the block diagram below. 



y 



A-xy — R X y Ryy 



x \y 



■© 



X±y 

X 



Since X±y has zero mean and is independent of 3^, we have E[X \ y] = X\y; i.e., the MMSE 
linear estimate Xiy is the conditional mean of X given y. Indeed, this decomposition shows 
that the conditional distribution of X given 3^ is Gaussian with mean X\y and autocorrelation 
matrix R ee = R xx —R xy R~yRy X , by Pythagoras. Thus X\y is evidently the unconstrained MMSE 
estimate of X given 3^; i-e., our earlier restriction to a linear estimate is no real restriction. 

Moreover, this block diagram implies that the MMSE estimate X\y is a sufficient statistic for 
estimation of X from y, since 3^ — Xiy — X is evidently a Markov chain; i.e., y and X are 
conditionally independent given Xiy. We call this the sufficiency property of the MMSE 
estimate. This implies that X can be estimated as well from the projection X\y as from y, so 
there is no loss of estimation optimality if we first reduce 3^ to X\y. 

Actually, Xiy is a minimal sufficient statistic; i.e., Xiy is a function of every other sufficient 
statistic f(y). This follows from the fact that the conditional distribution of X given f(y) must 
be the same as the conditional distribution given y, which implies that the conditional mean 
X\y can be determined from /(3-0- 
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1.6 A bit of information theory 



By the sufficiency property, the MMSE estimate X\y is a function of y that satisfies the data 
processing inequality of information theory with equality: I{X; y) = I(X; X\y). In other words, 
the reduction of y to X\y is information-lossless. 

Moreover, since X = A xy y + X±y is a linear Gaussian channel model with Gaussian input y, 
Gaussian output X, and independent additive Gaussian noise £ = X±y, we have 

I(X; y) = h(X) - h(X I y) = h{X) - h{£) = log 

I -"-ee I 

where we recall that the differential entropy of a set X of N complex Gaussian random variables 
with nonsingular autocorrelation matrix R xx is h(X) = log(7re) Ar |i? :EX |. (We assume that R ee is 
nonsingular, else {X, y} is linearly dependent, so at least one dimension of X may be determined 
precisely from y and I(X;y) = oo.) 



1.7 Chain rule of MMSE estimation 

Suppose that X, y, Z are jointly Gaussian sets of random variables and that we wish to estimate 
X based on y and Z. The MMSE estimate is then X\y Z , the projection of X onto the subspace 
y + Z generated by the variables in both y and Z. 

The subspace y + Z may be written as the sum of two orthogonal subspaces as follows: 

y + z = y+ (y L r\Z~^ . 

Correspondingly, we may write the projection X\y Z as the sum of two orthogonal projections as 
follows: 

x \yz = x \y + (X±y)\z ±y - 
We call this the chain rule of MMSE estimation. It is illustrated below: 




i x ±y)\z ±y 



x \y 

Generalizing, if we wish to estimate X based on a sequence y = {3^i , 3^2 , • • •} of random 
variables such that X and y are jointly Gaussian, then the chain rule of MMSE estimation 
becomes 

x \y = x \yi + ( X -Lyi)\(y2)± yi H ^ ( x ±yi- 1 )m) ±y ,-i H — » 

where 3^ 1 = {3^1, 3^2, • • ■ , 3?-i}- The incremental estimate (^±y^-^)\(y t ) t _ 1 thus represents 

the "new information" given by the innovations component (3^)j_;y*-i of the observation 3? 

about X, given the previous observations y\ _1 . 

The innovations representation may be seen as a special case of the chain rule of MMSE 
estimation. Indeed, if X = {Xi,X 2 , . . .} and we take y = X, then X\ x = X, and the "new 
information" sequence becomes 

( X ±x*- 1 hx i ) ±x i-i = ( x ±xi-i)\Ei = {0,...,0,£?i,...}; 

i.e., the first i components of (X^ x i-i)^ Xi ar e {0, . . . , 0, Ei}, where E^ = (Xi) ±x i-i is the ith 
innovation variable of X; the remaining components are evidently linearly dependent on Ei. 
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2 Successive decoding 



Often it is natural or helpful to regard a set X of Gaussian random variables as a sequence of 
subsets, X = {Xi,X2, . . .}. For instance Xi,X2,... might represent a discrete-time sequence, 
in which case the ordering naturally follows the time ordering; or, in a multi-user scenario, 
X±,X2, ■ ■ ■ might represent different users, in which case the ordering may be arbitrary. Thus 
the index set {1, 2, . . .} indicates an ordering, but is not necessarily a time index set. 

Our aim will be to signal at a rate approaching the mutual information I(X;y). As above, 
we may write 

I(X; y) = I(X; X\ y ) = h(X) - h(£) = log 

I -flee | 

where £ = {£1,82, . . .} is the sequence of estimation error subsets £; L = (Xi)±y. 

We will consider a successive decoding scenario in which the subsets X\ , X2 , . . . are detected 
sequentially from a set y of observed variables. For each index i, we will aim to signal at a rate 
approaching the incremental rate 

Ri = h(Xi I X*- 1 ) - h(£i I S^- 1 ), 

where X\~~ x = {X\,X2, ■ ■ ■ , Xi-i} and £\~ x = {X-~ 1 )±y. By the chain rule of differential entropy, 
we will then approach a total rate of = h{X) — h{£) = I(X-,y). 

For successive decoding, we will make the following critical assumption: 

Ideal decision feedback assumption: In the detection of the variable subset Xi, 
the values of the previous variables X\~ x are known precisely. 

The ideal decision feedback assumption is the decisive break between the classical analog 
estimation theory of Wiener et al. and the digital Shannon theory. If the Xi are continuous 
Gaussian variables, then in general it is nonsense to suppose that they can be estimated precisely 
(assuming that X and y are not linearly dependent). On the other hand, if the Xi are codewords 
in some discrete code C whose words are chosen randomly according to the Gaussian statistics of 
Xi given X- -1 , and if the length of C is large enough and the rate of C is less than the incremental 
rate Ri, then Shannon theory shows that the probability of not decoding Xi precisely given y 
and X\^ x may be driven arbitrarily close to 0. So in a digital coding scenario, the ideal decision 
feedback assumption may be quite reasonable. 

The MMSE estimate (Xi)\y x i-i of Xi is a sufficient statistic for estimation of Xi given y and 

X\~ l . Moreover, by the chain rule of MMSE estimation, we may alternatively write 

(X^iy^l- 1 = ( x i)\y + {(Xi)±y)\(xi- 1 ) ±y = + (^Vr 1, 
The estimation error is (£i) ±£ i-i. In short, Xi is the sum of three independent components: the 

MMSE estimate of Xi given y, the MMSE prediction of £i given £[ _1 , and the estimation error 
(£j)j_£»-i. The differential entropy of the estimation error may thus be written in any of the 
following ways: 

h((£i) ±£rl ) = h(Xi I y,^- 1 ) = h(£i I fj" 1 ). 

We note therefore that Yli Ri = ^(^j y) follows alternatively from the chain rule of mutual 
information, since 

I{Xi-y I Xl x ) = h(Xi I Xl x ) - h(Xi 1 y, X^ 1 ) = h{X { \ X^ 1 ) - h(£i \ £)- x ) = R t . 
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Successive decoding then works as follows. The sequence to be decoded is X\, X 2 , . . ., and the 
observed sequence is y. We first reduce y to the MMSE estimate ( x i)\y and decode X\ from it, 
in the presence of the error £\ = (Xi)±y. If the decoding of X\ is correct, then we can compute 
£ i = X\ — [X\)\y and form the estimate (£2)1^ > which we add to (X^)\y to form the input to a 
decoder for X 2 with error (£-i)\£ x , and so forth. 

This "decision feedback" scheme is illustrated in the figure below. The "forward filter" A xy is 
the MMSE estimator of the sequence X given y. The "backward filter" is the MMSE predictor 
of £i given £{~ , where ideal decision feedback is assumed in computing the previous error £^ 1 . 



y 



A 



j-y 



x \y = {( x i)\y, ( x 2)\y, ■ ■ ■} 



i )\y,x[- 



decoder for X; 



X — {X\,X 2 , . . .} + 







£ = {£t,£ 2 ,...} 


backward filter 







This decision-feedback scheme is said to be in "noise-predictive" form, since the error sequence 
£ is predicted by the causal backward filter. By linearity, we can put it into more standard 
decision- feedback form as shown below, where the backward filter is denoted by A b : 



y 


A xy 


x \y , 


l-A b 







~\\ x i)\y,x[- 1 


decoder for Xi 


x i 



















Successive decoding thus breaks the joint detection of X = {Xi, X 2 , . . .} into a series of "per- 
user" steps. This idea underlies classical decision- feedback schemes for sequential transmission on 
a single channel, and also successive interference cancellation schemes on multi-access channels. 

Moreover, if we can achieve a small error probability with a code of rate close to R{ for each i, 
then we can achieve an aggregate rate close to I(X; y) with an error probability no greater than 
the sum of the component error probabilities, by the union bound. Again, this holds regardless 
of the ordering of the users. 

In practice, achieving a rate approaching the mutual information will require very long codes. 
This is usually not an obstacle in a multi-access scenario. In the case of sequential transmission 
on a single channel which is not memoryless, it can be achieved in principle by interleaving 
beyond the memory length of the channel (for details, see 7J. Alternatively, if the channel is 
known at the transmitter, then interference may be effectively removed at the transmitter by 
various precoding or precancellation schemes (e.g., PJEJIH]). 

These schemes naturally extend to infinite jointly stationary and jointly Gaussian sequences 
X = {. . . , Xq, Xi, . . .} and y = {. . . , y^, y±, . . .}. The forward and backward filters shown above 
become time-invariant in the limit. Cholesky decompositions become multivariate spectral fac- 
torizations. Sequence mutual information quantities such as I(X; y) are replaced by information 
rates. For a full development, see Guess and Varanasi The point is that the conceptual basis 
of the development is essentially the same. 

In summary, when the signal to be detected and the observation are jointly Gaussian, and our 
objective is to maximize mutual information, we may always incorporate an MMSE estimator 
into the receiver, because an MMSE estimator is a sufficient statistic and thus information- 
lossless. 
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